7.4 Compression
85
greatest for files with a lot of repetitive material, but according to van der Waerden’s
(1927) extension of Baudet’s conjecture, any string of two kinds of symbols has
repetitive sequences of at least one of the symbols.
7.4.1
Use of Compression to Measure Distance
Suppose two ergodic binary sources upper PP and upper QQ emit 1s with probabilities pp and qq,
respectively. The Kullback–Leibler (1951) relative entropy between the two strings
is
upper S Subscript upper P upper Q Baseline equals minus q log Subscript 2 Baseline StartFraction p Over q EndFraction minus left parenthesis 1 minus q right parenthesis log Subscript 2 Baseline StartFraction 1 minus p Over 1 minus q EndFractionSP Q = −q log2
p
q −(1 −q) log2
1 −p
1 −q
(7.8)
and may be used as the basis of a measure of distance between the two strings.
Benedetto et al. (2002) have devised an ingenious method for estimating upper S Subscript upper P upper QSP Q from
two sources by zipping a long string from each source (upper PP andupper QQ), to each of which,
prior to zipping, is appended a sufficiently short string fragment (say upper P primeP,) from one
of the sources. upper S Subscript upper P upper QSP Q is then the difference in coding efficiency between upper P primeP, coded
optimally because it follows upper PP (the source is ergodic) and upper P primeP, coded suboptimally
because it follows upper QQ. Using upper LL to denote the length of a zipped file,
upper S Subscript upper P upper Q Baseline equals left bracket left parenthesis upper L Subscript upper Q plus upper P prime Baseline minus upper L Subscript upper Q Baseline right parenthesis minus left parenthesis upper L Subscript upper P plus upper P prime Baseline minus upper L Subscript upper P Baseline right parenthesis right bracket divided by upper L prime Subscript upper P primeSP Q = [(L Q+P, −L Q) −(L P+P, −L P)]/L,
P,
(7.9)
(in bits per character), whereupper L prime Subscript upper P primeL,
P, is the unzipped length of the short string fragment
upper P primeP,. In order to eliminate dependency on the particular coding, a different normaliza-
tion may be used:
upper S Subscript upper P upper Q Baseline equals StartFraction left parenthesis upper L Subscript upper Q plus upper P Sub Superscript prime Subscript Baseline minus upper L Subscript upper Q Baseline right parenthesis minus left parenthesis upper L Subscript upper P plus upper P Sub Superscript prime Subscript Baseline minus upper L Subscript upper P Baseline right parenthesis Over upper L Subscript upper P plus upper P Sub Superscript prime Subscript Baseline EndFraction plus StartFraction left parenthesis upper L Subscript upper P plus upper Q Sub Superscript prime Subscript Baseline minus upper L Subscript upper P Baseline right parenthesis minus left parenthesis upper L Subscript upper Q plus upper Q Sub Superscript prime Subscript Baseline minus upper L Subscript upper Q Baseline right parenthesis Over upper L Subscript upper Q plus upper Q Sub Superscript prime Subscript Baseline EndFraction periodSP Q = (L Q+P, −L Q) −(L P+P, −L P)
L P+P,
+ (L P+Q, −L P) −(L Q+Q, −L Q)
L Q+Q,
.
(7.10)
7.4.2
Ergodicity
Ergodicity means that every allowable point in phase space is visited infinitely often
in infinite time or, in practice, every allowable point in phase space is approached
arbitrarily closely after a long time. Ergodicity is a pillar of Boltzmann’s assumption
that the microstates of an ensemble have equal a priori probabilities, and indeed of
the rest of statistical mechanics. Nevertheless, as our knowledge of the world has
increased, it has become apparent that ergodicity actually applies only to a small
minority of natural systems. Although some systems may not even be ergodic in
the infinite time limit, most observed departures from ergodicity occur because of
the inordinately long times that would be required to fulfil it. The departures are